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1. INTRODUCTION 

During the COVID-19 pandemic, face masks became a necessary part of a person’s attire. Despite 
their preventative advantages, they have some negative impacts on speaking and hearing. Physically, all face 
masks have some effects on the components of the human speaking systems. Wearing a face mask reduces 
the movability of the lips, the oral cavity, the jaws, and the tongue. Face masks act as a physical obstacle 
against the airflow of the mouth and the nose. Face mask-wearing has subjectively changed the audio 
features of human speech. 

According to Fourier analysis, any periodic function can be analyzed as an infinite series of 
trigonometric functions (sets of sines and cosines). The frequencies of these functions are discrete and multiple. 
The first frequency (harmonic) is called the fundamental and has the greatest magnitude. The waveform of 
speech is not completely periodic, but it has many periodicities in the specific short term. Scientists and 
researchers in audio, language and speech signal processing have been efficiently exploiting this merit. 
For speech, this first frequency is called pitch and is abbreviated as FO. FO has the greatest energy among the 
other frequencies during multiples of 10 ms periods, due to the short-time discrete Fourier transform 
(STDFT). The FO detection process is related to the category of the speech in that period, voiced or unvoiced 
speech. Generally, the pitch description is the correlation perceptual of an audio fundamental frequency FO. 
Perceiving the FO is important for general attribution to compare audio with various timbres due to the 
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possibility of errors in hearing. Alternatively, because FO is the compact measurement for a group of 
frequency harmonics, FO efficiently provides an abstract representation of audio for human memory 
storage [1]. 

To estimate FO, pitch detection algorithms (PDA) are utilized. The range of an estimated FO is 
between 40 Hz and 600 Hz. Usually, the FO of females is higher than that of males [2]. Many algorithms 
have been proposed to detect the pitch FO of audio and speech signals. Typically, one of the PDA algorithms 
detects and measures the duration of the quasiperiodic speech and audio signals, and then reciprocates the 
calculated value to extract the pitch frequency FO. Some PDA algorithms (e.g., ones based on autocorrelation) 
need two or more FO periods (about 50 ms) to estimate the pitch. There are three main approaches for PDA: 
time-domain e.g., YIN “the yin and yang is an oriental philosophy” [3], frequency-domain e.g., Tolonen and 
Karjalainen algorithm (TK) [4], and spectral/ temporal approach e.g., the yet another algorithm for pitch 
tracking (YAAPT) [5]. 

The efficient algorithms and applications of PDA are: The PRAAT “the imperative form of to speak 
in Dutch” application was presented by Boersma [6], to detect speech signal periodicity robustly and directly 
by lag autocorrelation. After tests, the PRAAT has obvious immunity against jitter and additive noise to 
periodic signals. For speech signal analysis, the PRAAT has more accuracy for magnitude orders than the 
commonly used methods. The PRAAT measures the harmonics-to-noise ratio (HNR) for the lag domain. 
The measurements are reliable and accurate when compared with traditional frequency-domain approaches. 
An online open-source PRAAT application is available [7]. Sun [8] proposed the sub-harmonic/harmonic 
(SHRP) algorithm to find FO, depending on the ratio of SHR. By following alternate cycles of speech, pitch 
via spectrum shifting is estimated. The scale of frequency is logarithmic, and the SHR is calculated. 
The algorithm is evaluated via two databases. Performance of SHRP exceeded other PDAs. An online open-source 
SHRP Matlab toolbox is available. The maximum likelihood was adapted by Noll [9] with a cepstral analysis of 
frequency components product to detect FO. The harmonics have been matched to pre-defined spectrum 
schemes. Gruber [10], the possibility of spectrum polyphonic detection was investigated. To detect the harmonics 
spectrum, a periodogram to transform the time-domain waveform was used. Brown and Puckette [11], 
an improvement of FO detection by the discrete cosine transform (DCT) spectrum was derived using the phase. 
The short-time Fourier transform (STFT) bins can be utilized to increase the accuracy of harmonics re-assignment 
using the phase. In addition to phase, magnitude is used to increase the accuracy. By Zahorian et al. [5], 
the YAAPT uses a time-domain tracking with an auto-correlation to normalize cross-correlation. In the 
frequency domain, the researchers used spectral attributes to find the pitch precisely. Tolonen and Karjalainen [4] 
suggested the TK algorithm to find multi-FO by analyzing the periodicity of the speech signal. They partitioned 
the signal into two bands, lower and higher than 1 kHz. They invoked the summary autocorrelation function 
(SACF), and enhancement auto-correlation function (ESACF) between two signals to detect the speech 
signal periodicity. Medan et al. [12] super-resolution pitch-detector (SRPD) was derived. The procedure is 
based on the similarity of speech excitation techniques. The procedure has an infinite resolution, greater 
accuracy for FO, robustness against noise, more reliability, and less computational complexity. The procedure 
is applicable for speech processing that needs analysis of synchronous spectral FO. The speech transformation 
and representation using adaptive interpolation of weighted spectrum (STRAIGHT) paradigm was introduced 
by Kawahara et al. [13] Analysis of time frequency is used with group delay and instantaneous frequency. 
To extract signal attributes, the paradigm measures the aperiodicity of the frequency domain and the energy 
concentration in the time domain. A modified method is executed by minimizing perceptual disturbances, 
according to errors in the extracted attributes. The sawtooth waveform inspired pitch estimator (SWIPE) was 
developed by Camacho and Harris [14] to process music and speech to detect the FO as the first harmonic 
(fundamental) of the sawtooth signal. The sawtooth has the best spectrum that matches the input speech 
signal spectrum. The kernel of decay cosine yields an extension for the previous frequency-based, sieve-type 
detection algorithm by giving smooth peaks for decaying amplitudes with the harmonics of the signal 
correlation. 

Robust algorithm for pitch tracking (RAPT) 

The well-known PDA detection is the RAPT. The algorithm was proven by Talkin and Kleijn [15] 
and is based on cross-correlation. The main improvement of this estimator is the reduction of computational 
complexity. The sequential outline of the RAPT algorithm is: 

1) Providing speech samples with their sampling rate and with a reduced sampling rate. 

2) Periodically, computing normalized cross-correlation function (NCCF) of the reduced sampling rate 
speech signal with lags in the FO range. 

3) Indicating the locations of maximum at the 1“ pass of NCCF. 

4) For the vicinity of the peaks in that 1* pass, calculate the NCCF for the original sampling rate. 

5) Again, finding the maximum in that NCCF. Obtaining the location and amplitude of the modified peak. 

6) For each peak obtained from the NCCF (high resolution), estimate the FO of the processed frame. 

7) The hypothesis of the frame for unvoiced/voiced is advanced for each frame. 
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8) Finding the group of the NCCF peaks via optimization process for the unvoiced/voiced hypotheses for 
all the frames which have the best match with the above characteristics. 
9) Using the well-known speech pitch tracking algorithm (PTA), RAPT has the following differences: 

— PTA computes the NCCF in the linear prediction coding (LPC). RAPT computes the NCCF in the 
original speech signal. 

— Two stages of NCCF are used to reduce the overall computational load. There is a similarity 
between RAPT and the simplified inverse filtering technique (SIFT) in NCCF double-stage 
computing. To increase accuracy, RAPT uses maximum interpolation for the samples with a high 
sampling rate. 

Kaldi pitch tracker is an improved RAPT [16]. Performance of tonal languages has been improved, 
when the Kaldi FO tracker with the estimator of “probability-of-voicing” is used in automatic speech recognition 
(ASR). The original RAPT makes the hard decisions either voiced or unvoiced for each frame. The FO tracker 
assigns the pitch for the unvoiced frame while constraining each pitch trajectory is continuous [17]. 


2. UNMASKED AND MASKED FACE SPEECH 

For Arabic speech, there is a lack of research and literature focusing on the acoustic effects of face 
masks. The reasons for that, are the short period of the pandemic, the small number of mask-wearing people 
before the pandemic, and the difficulty with the unfeasibility of research on that subject. The following 
references and literature supporting this research are: 

Atcherson et al. [18] found a clear difference between spectral analyzes of the speech stimuli with 
masks and without. The root mean square (RMS) value of that difference is about 2 dB. Consistently, 
the national health service (NHS) listeners performed the tests across different conditions. Transparent masks 
provide a benefit of visual input for listener groups with hearing impairment. Speech perception with noise was 
greatest on the improved magnitude scale for the group with severe-to-profound hearing loss. Corey et al. [19] 
denote the muffled speech due to face masks with more difficulty of communication for people with hearing 
loss. Acoustic attenuation caused by different face masks is examined using different types of masks. 
A human talker and head-shaped speaker are used. The resulting speech for all masks is attenuated above 1 kHz 
frequencies. The greatest attenuation occurs in front of the speaker. Between cloth masks, there is substantial 
variation due to weaves and material types. Compared to cloth and medical masks, transparent masks have 
bad acoustic performance. Lapel microphones have a negligible effect against most masks. The researchers 
suggested that assistive listening systems and existing sound reinforcement are useful for masked verbal 
communication. Deshpande and Schuller [20] summarized the researcher’s community efforts toward helping 
society and individuals, against the pandemic by speech and audio digital signal processing. Deep techniques 
are summarized to contribute short-term solutions. The article is an overview of the contributions from 
modalities of non-speech. These modalities serve or complement as inspiration for speech and/or audio analysis. 
The researchers discussed the observations with challenges, feasible solutions, and the achievements of 
significant technologies. Ribeiro et al. [21] discussed the following difficulties due to face masks: 
intelligibility of speech, vocal effort perception; auditory feedback, and speech coordination. The researchers 
concluded that for necessary and professional activities, face mask-wearing had a higher perception for 
symptoms of discomfort and vocal fatigue, increased vocal effort, difficult intelligibility of speech, 
and speech coordination. Bottalico et al. [22] studied the effects of face mask-wearing on communication in a 
classroom. Evaluation of speech intelligibility variations due to traditional face masks (N95, medical/surgical, 
and fabric). Auralized classroom students have presented that speech intelligibility. Realistically, under 0.4 s 
and 3.1 s reverberation times, classroom conditions are simulated. With a 3 dB signal-to-noise ratio (SNR), 
speech-shaped noise presents speech stimuli. A greater drop in speech intelligibility was yielded due to fabric 
masks in comparison to N95 and medical/surgical masks. For teaching environments, they recommend N95 
and/or medical/surgical masks. Das and Li [23] studied audio features using linear filter-banks. 
An instantaneous phase with long-term attributes captures classified artifacts of the speech signal with a face 
mask and without. The extracted features were used alongside the following toolkits: deepspectrum, 
bag-of-audio-words, audeep [24], and computational paralinguistic evaluation functional (ComParE). 
They revealed the capability of audio features, and the score fusion level using the baselines of 
ComParE2020 produces 73.5% test sets of average recall. With a noisy background, Thibodeau et al. [25] 
investigated the auditory-visual recognition of 154 talkers with and without opaque and transparent masks. 
The researchers continued their smaller study with 29 talkers. Observed differences between opaque and 
transparent masks have been attributed to acoustic differences and visual gestures. In a quiet room, online 
sessions, listeners heard 40 minutes via listening devices. The devices are assistive hearing aids and earbuds. 
The talkers had normal hearing, suspected or confirmed hearing loss, and were using listening assistance 
devices and without the devices. Nguyen et al. [26] compared speech measurement via a record of 16 adults 
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with a KN95 or a medical mask and without. Average of spectral levels for the 2 bands, below 1 kHz and 
between 1 kHz to 8 kHz, the researchers analyzed the first band to the second energy ratio, HNR for the 
2 bands, vocal intensity, and smooth cepstral peak prominence (CPPS) of the 2 bands. There is an obvious 
average spectral level attenuation at the 1-8 kHz band; meanwhile, the attenuation is negligible at less than 
1 kHz band. For face masked speech, vowel average spectral levels had little change. HNR is greater for the 
face masked speech than the unmasked speech. Mask-wearing did not affect the vocal intensity and CPPS 
much. Cohn et al. [27] tested the influence on comprehension of a fabric face masked speech. Three styles of 
speech (clear, casual, positive-emotional) with and without masks were compared. Subjectively, listeners had 
tested the speeches. In word identification, the tests were denoted as highly accurate for babbling clear 
conversation, and for casual conversation, they were denoted as not very accurate. The accuracy for 
emotional speech was moderate. For clear style, face masked speech had greater intelligibility than the 
unmasked. For emotional style, the face masked speech had lower intelligibility than unmasked. No significant 
difference was observed for the casual style. This may imply that emotional/ casual styles had less of an intent 
to be understood clearly by the listeners. Toscano and Toscano [28] studied N95 respirator, surgical, and cloth 
masks’ effects on speech recognition with multi-talker-babble. On quieter backgrounds, masks had insignificant 
effects, less than a 5.5% reduction in accuracy relative to unmasked conditions. In background with higher noise 
levels, average accuracy reduction ranged from 2.8% to 18.2% relative to unmasked conditions, except for 
surgical masks. The study demonstrated that most mask types yielded similar accuracy for low noise level 
backgrounds; however, differences among the masks were more pronounced in high noise level backgrounds. 


3. DATABASE FOR FACE MASKED ARABIC SPEECH 

The main issue facing the research was the unavailability of a face-masked Arabic speech database. 
Standard speech databases and Arabic academics did not provide the required face masked speech which is 
recorded under restricted conditions (the restriction is very important to achieve a fair comparison). 
This article’s researchers made efforts and consumed a lot of time to produce such a database. The difficulty 
is due to middle-eastern culture, privacy, and security matters. The female speech recording was more 
difficult than the male. 

The researchers have built the required face masked/unmasked Arabic speech database with the two 
genders and wide-range age of speakers. The female/male speakers were divided into five age groups: under 
12, 12-18, 18—40, 40-55, and older than 55 years old Figure 1(a). For each previous age range, 6—10 persons 
have recorded their speech. About 50% of them are female, and the others are male. All of them, have recorded 
the same Arabic (Iraqi accent) counting sentence: “wahid 25/5 thnain cei thelatha +24 arba’a 4» yf 
khemsa +3, cita +44 seba’a 42 thamanya +i] tissa’a 424i eshra e»ic, and hde’ash Vic’, They mean 
counting from one to eleven; Figure 1(b). To get more reliable results and conclusions, this article’s Arabic 
database has been compared subjectively and objectively with the standard acoustic databases such as 
“TIMIT acoustic-phonetic continuous speech corpus”. Our database can be evaluated as a small Arabic 
reliable database for face masked/unmasked speech for different ages and genders. 

Typically, each person masked his mouth, nose, and jaw with a medical mask. PC sound-card with 
double channel (stereo) and 3.5 mm audio jack were used for the recording. On the other terminal of the 
audio cable, two microphones are connected. The microphones are fully identical in the installation and 
brand. The first microphone is connected to the left channel, and the second is connected to the right. 
The first microphone is located under the mask (records the face-unmasked speech) and the second is located 
outside the mask (records the face masked speech). Different types of microphones were tested, and then the 
least noisy microphones were chosen and installed with similar lengths and types of terminals; Figure 1(c). 

Each person said the counting sentence one time only, so the sound-card records simultaneously two 
speeches (the face-unmasked and the face masked). The recorded speech is saved as a “.wav’” double channel 
(stereo) audio file. The sampling rate of the recorded speech is 16 kHz with 16 bit/sample resolution. 
Depending on the talker, the period of the counting sentence is 10 tol5 seconds (s). Before processing, long 
durations of silence are removed manually. Recorded speech that’s suffered from clipping is deleted and then 
re-recorded carefully. Subjectively, there’s a clear similarity for the envelope of normalized speech, but 
there’s a difference between their amplitude (sample-by-sample); Figure 1(b). The differences are due to the 
physical effects of the face mask. The Adobe-Audition® and the Audacity® audio signal processing 
applications were used for recording, playing, editing, spectrogram displaying, frequency-domain analyzing, 
subjective testing, and visual plotting. 

Before the main process to detect F0, the recorded “.wav” files were tested in the time and 
frequency domains by using the Audacity® and the Adobe-Audition® integrated development environment 
(IDE). Those domains reflect the main views of the recorded audio speech. The other secondary parameters 
have been tested subjectively by those IDEs such as noise, silence, and tones. The analysis features have been 
used to check the spectrum, the contrast, and the clipping. For more details, the “tools” tag was used for the 
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cross-checking between the two channels of each speaker, between the speech of different genders of each 
channel, and then among the same gender speakers for each audio channel. 


(a) (b) (c) 


Figure 1. Recording database: (a) 14 years old male speaker; (b) a typical sample of stereo unmasked and 
masked speech of Arabic for counting from 1 to 11; and (c) installation of 3.5 mm 2-channel identical 
microphones; the first is for unmasked speech and the second is for masked 


4. EXPERIMENTS AND SUBJECTIVE TESTS 

By running the RAPT algorithm Matlab® programs, the FO is estimated for the voiced speech. 
The unvoiced speech (without the FO) is also detected. The instants of the voiced speech (with its FO) and the 
unvoiced speech are precisely extracted. Besides the Matlab integrated development environment (IDE), 
the Notepad++® edits the required speech parameters (e.g., sampling rate) in the source file, and the 
Audacity® plays and displays the 2-channel speech signals. 

The RAPT default values of Talkin and Kleijn [15], Gonzalez and Brooks [29] are: Hanning 
window, the sampling rate of 16 kHz, minimum FO of 50 Hz, maximum FO of 500 Hz, frame time of 10 ms, 
low pass (LP) filter window size of 5 ms, the correlation window size of 7.5 ms, minimum peak in 
normalized cross-correlation function of 0.3, taper factor of 0.3 (linear lag), FO change of 0.02 cost factor, 
transition cost of 0.005 (voice state fixing), transition cost of 0.5 for delta modulation, the bias for 
encouraging voice of 0 (hypotheses), doubling/halving of 0.35 cost for exact values, the noise level of 0 for 
absolute RMS, a level relative of noise (RMS noise) floor of 2, SNR of 0.001 (peak S to floor R), window 
length of 30 ms (RMS measurement), window spacing of 20 ms (RMS measurement), maximum hypothesis 
for each frame of 20, position in s-plane of -7000 (pre-emphasis 0.0), and the number of full lags to try of 7. 
The above default parameters have been changed to adapt to the masked/unmasked speech signal of the 
research. The sampling rate is still at the standard 16 kHz with 2 bytes (16-bit resolution for each sample of 
the two observation signals). The range of FO was increased to cover the 40 Hz to 600 Hz band. The LP filter 
window was 5 ms with an additive of 2.5 ms for the correlation-window size (i.e., 7.5 ms). Most of the other 
parameters have been changed by + 25% of the standard default values. The length of the main window 
(speech frame) was 25 ms to 40 ms with 15 ms to 30 ms spacing. 

The time-domain plots by graph-mode implementation illustrate the lag candidates and FO larynx 
frequency of the voiced speech frame-by-frame. For unvoiced speech and/or silence, “nan” is returned. 
For the candidate FO, start and/or end samples for the frame, a flag is returned at the beginning of each 
speech spurt Figure 2(a). Suggestions and bugs include backward dynamic programming (DP) for the pass 
with true-cost output for any FO candidates; discrimination between the silent and/or the voiceless state, 
the best DP for the long-period penalties such as twice or half frequency FO. After the necessary 
implementations, the resulting data are collected. The subjective tests of the data denote that: 

— More than 3 dB attenuation in the signal energy of the face masked speech. 
— The attenuation and the noisy-condition effects are similar for both genders. 
— Age groups did not affect the noisy background and the attenuation. 

The above subjective tests gave us rough indications of the time-domain formulation of the signals 
for the Arabic masked and the unmasked speech of the two genders and the five ranges of age groups. 
The tests did not have exact numerical values for the attenuation and how much SNR for the two signals with 
the comparisons between these ratios, Figure 2(a) illustrates the low-noise background and Figure 2(b) on the 
high-noise background. Since the above subjective tests did not have enough evaluations to decide the effects 
of the medical mask on the FO of Arabic speech, the researchers used two approaches for that task. The first 
approach is the objective standard and non-standard tests. The second approach is the mathematical model 
which supports the objective tests. For that, the researchers used statical analysis, because they have different 
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useful parameters. The analysis was used to support the standard objective tests. The next title has more 
details about the analysis and standard objective tests. 
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Figure 2. Typical graphs for face masked speech: (a) on the low-noise background and (b) on high-noise 
background 
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5. OBJECTIVE TESTS 

The subject tests of the experiments’ results did not indicate exactly the numerical values of FO 
increase/decrease shifting due to the mask-wearing. The SNR ratio also cannot be calculated by subjective tests. 
Using the following statistical process, objective tests are invoked to investigate the effects of mask-wearing on 
the FO for the above 5 age groups of female/ male speakers. The standard and non-standard following 
objective-tests criteria are measured to compare the FO of the unmasked and face masked speech [30]: 
i) minimum (FO) of the FO minima (MnFO) of the conversations of each age group; ii) maximum (FO) of the 
FO maxima (MxFO) of the conversations of each age group; and iii) mean (M) is the average value of the 
averages of the conversations of each age group. 

The listed data in Table 1 (for females) and Table 2 (for males) confirm the fact that FO for the 
female is higher than FO for the male for the masked speech and the unmasked speech. The tables provide 
good indications of the minimum, average, and maximum FOs, but do not clarify the overall details about 
other FOs: Figure 3(a) is for females, and Figure 3(b) is for males. For that, statistical analysis is exploited. 


Statistical analysis for the pitch of mask-wearing Arabic speech (Hasan M. Kadhim) 


852 o ISSN: 1693-6930 


Table 1. The minimum of FO minima (MnFO0), an average of FO averages (M), and the maximum of FO 
maxima (MxF0) for unmasked (U) and masked (MSK) speech of 5 female age groups of speakers FO (Hz) 
Age < 12 years old 12-18 years old 18—40 years old 40-55 years old _55 years old < age 


U MSK U MSK U MSK U MSK U MSK 
MnFO 51 52 51 52 50 51 51 50 51 51 
M 250 212 199 175 149 
MxF0 445 457 438 448 436 444 433 435 431 432 


Table 2. The minimum of FO minima (MnF©0), the average of FO averages (M), and the maximum of FO 
maxima (MxF0) for unmasked (U) and masked (MSK) speech of 5 male age groups of speakers FO (Hz) 
Age < 12 years old 12-18 years old 18—40 years old 40-55 years old _55 years old < age 


U MSK U MSK U MSK U MSK U MSK 
MnF0 51 50 51 50 51 50 50 51 51 51 
M 212 195 133 116 100 
MxF0 345 332 335 343 328 321 326 325 323 326 
500 350 
400 = 
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Figure 3. For unmasked and masked speech: (a) for female speakers and (b) for male speakers 


a) Probability density function (PDF) of FO, by using the histogram calculations, the FO for the unmasked 
and face masked speech of male and female speakers are illustrated in Figure 4(a) for female and 
Figure 4(b) for males. The FO range is from 50 to 500 Hz. The subjective test denotes that the FO 
distribution for the male has more variance than the female FO distribution. Male FO is concentrated on 
the higher frequencies, while the female is concentrated on the lower. 

Subjectively, most conversations in the recorded database have two major lobes in the PDF 
distributions and many minor lobes in those distributions. The first major lobe, at the lower frequencies, has 
less energy than the second major lobe at the higher frequencies. The FO distribution concentrates on those 
two major lobes, i.e., most of the FO energy is located inside those two major lobes. 


0.01 Solid 


0.008 | | PDF 0.03 
PDI | 


0.006 Í 


Lo pm o Z s 


0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 


FO (Hz) F0 (Hz) 


(a) (b) 


Figure 4. Probability density function (PDF) of FO for the unmasked and face masked speech: (a) female and 
(b) male 
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For the original unmasked speech, the cumulative distribution functions (CDF) are calculated from their 
corresponding PDFs and illustrated in Figure 5(a) for females and Figure 5(b) for males. According to 
statistical axioms, CDF = 0.5 at the average value of PDF, i.e., at the average FO of the speech. The FO 
frequencies less than average FO are considered as a lower band of the speech from 50 Hz to M. The FO 
frequencies higher than average FO are considered as an upper band of the speech from M to 500 Hz. 
Since the PDF distributions of most conversations in the recorded database have two major lobes in the 
FO distribution, the CDF of those distributions has two main risings in the configuration through 
frequency domain distribution of them. The second rise is greater than the first because the energy 
content of the second is higher than the first across the PDF distribution. 
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Figure 5. Normalized PDF and it’s CDF of FO for the unmasked speech: (a) for female face unmasked 
speaker and (b) for male face unmasked speaker 


The classification error (CE) is the percentage ratio of the unvoiced-speech frames (for the unmasked 

speech), and those classified as voiced-speech frames (for the face masked speech). It’s the ratio of the 

voiced-speech frames (for the unmasked speech), and those classified as unvoiced-speech frames (for the 

face masked speech). The CE is a standard objective test for the PDA algorithm. The CE percentage (%) 

tests are tabulated in Table 3 and Figure 6(a). 

The gross error (GE) is the percentage ratio of the estimated pitch value of the voiced-speech frames 

(for the face masked speech), which deviates about 20% of the reference pitch value (for the unmasked 

speech). The gross error is a standard objective test for the PDA algorithm. The percentage gross error 

GE% tests are tabulated in Table 3 and Figure 6(b). More details will be illustrated in the next 

paragraphs. 

The researchers proposed the following procedure to manipulate the details of FO on the frequency 

domain. The FO bandwidth range 50 Hz to 500 Hz is divided into two bands, LB from 50 Hz to average 

FO (M), and the UB from average FO (M) to 500 Hz. For the five female and male age groups of speakers, 

and Table 5 contain the comparison measurements of these groups concerning gender and 

sub-band. Figure 7 (for females) and Figure 8 (for males) illustrate these data (Figure 7(a) and Figure 8(a) 

for the lower band, and Figure 7(b) and Figure 8(b) for the upper band). According to these experimental 

measurements: 

— Most of the lower-band FO (55% to 70%) remained unchanged for different ages and genders. The 
average of them is about 40%. 

— Most of the upper-band FO (about 70%) remained unchanged for males older than 12 years old. For 
males less than 12 years old, there is a similarity with female upper-band FO changing. 

— For different female ages, the unchanged upper-band FOs fluctuate from 60% to 73%. 

— For the lower band, the range of the number of FO increases is (15% to 25%) of the total FO for 
different ages and genders. Its average is about 20%. 

— The average increase of FO value is about 22% of the original FO unmasked speech. 

— For the lower band, the range of the number of FO decreases is (15% to 22%) of the total FO for 
different ages and genders. 

— The decrease of the FO value is less than 18% of the original FO unmasked speech. 

— For the upper band, the number of FO increases and decreases is less than 16% of the total FO for 
different ages and genders. 
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Table 3. The percentage CE, the percentage GE, and the mean value (M) of FO for face masked speech of the 
5 female/male age groups of speakers. Their original unmasked speech is the reference 
Age < 12 years old 12-18 years old 18—40 years old 40-55 years old _55 years old < age 


Female (F) Male (M) F M F M F M F M 
CE% 12% 10% 11% 8% 8% 1% 8% 1% 71% 1% 
GE% 42% 44% 41% 42% 40% 41% 39% 39% 38% 38% 
M (Hz) 250 212 212 195 199 133 175 116 149 100 
1 5 45 T 
40 
35 
30 
CE% GE% 25 
20 
15 
10 
5 
~ 0 
age<12 40-55 55<age age<12 18-40 40-55 55<age 
First bar: Female First bar: Female 
Second bar: Male Second bar: Male 
(a) (b) 


Figure 6. The objective tests of FO for face masked speech of the 5 age groups of speakers: (a) The CE% and 
(b) The GE%. Their original unmasked speech is the reference 


70 - 


60 
LB: 
EQL 50 
INC UB: 
INCV EQL 
DEC INC 
DECV INCV 40 
DEC 
DECV 
30 
20 
l o L 


age<12 18-40 40-55 55<age age<12 12-18 18-40 40-55 55<age 


1% bar: EQL, 2" bar: INC, 3 bar: INCV, 4% bar: DEC, 5¢ bar: DECV 1% bar: EQL, 2™ bar: INC, 3" bar: INCV, 4* bar: DEC, 5% bar: DECV 


(a) (b) 
Figure 7. Comparison between the LB and the UB of the 5 female age groups: (a) for the LB and (b) for the UB 


Table 4. Comparison between the LB and the UB of the 5 female age groups 
age<|2years old 12-18 yearsold 18-40 years old _40-55years old —_ 55years old< age 


EQL LB 69% 65% 70% 66% 69% 

UB 55% 60% 57% 56% 64% 
INC LB 16% 18% 16% 18% 16% 

UB 23% 11% 25% 25% 15% 
INCV LB 0.14 0.20 0.13 0.15 0.10 
(PU) UB 0.12 0.10 0.16 0.14 0.10 
DEC LB 15% 17% 14% 16% 15% 

UB 31% 18% 27% 30% 18% 
DECV LB 0.11 0.14 0.16 0.11 0.08 
(PU) UB 0.16 0.15 0.13 0.18 0.11 
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(a) (b) 


Figure 8. Comparison between the LB and the UB of the 5 male age groups: (a) for the LB and (b) for the UB 


EQL is of the unchanged FO. INC and DEC are the FO increase and decrease changes respectively. INCV and 
DECYV are the per-unit (PU) values of the FO increase and decrease changes respectively. The calculations 
are per-unit to the sub-band number of FO. 


Table 5. Comparison between the LB and the UB of the 5 male age groups 
age<l2years old 12-18 years old 18-40 years old 40-55years old —_ 55years old< age 


EQL LB 65% 60% 61% 61% 62% 

UB 70% 71% 73% 72% 712% 
INC LB 18% 21% 19% 21% 20% 

UB 20% 12% 15% 13% 13% 
INCV_ LB 0.13 0.16 0.14 0.17 0.14 
(PU) UB 0.17 0.10 0.15 0.13 0.12 
DEC LB 17% 19% 20% 18% 18% 

UB 15% 18% 12% 15% 15% 
DECV LB 0.14 0.16 0.20 0.19 0.16 
(PU) UB 0.09 0.11 0.06 0.10 0.08 


EQL is of the unchanged FO. INC and DEC are the FO increase and decrease changes respectively. INCV and 
DECV are the values of the FO increase and decrease changes respectively. The calculations are per-unit 
(PU) to the sub-band number of FO. 


6. CONCLUSION 

For the effect of mask-wearing against the Arabic speech FO, from the above data, tables, and 
graphs, the FO changed less for males than females. The change is less for the older than the younger males 
and females. The change is less significant for the low-frequency FO than the high-frequency FO of the FO 
bandwidth from 50 Hz to 500 Hz. The FO changes in females younger than 12 years old are fewer compared 
with similarly-aged males. The FO changes of females older than 12 years old were approximately equal 
compared with similarly-aged males. The probability density function and cumulative distribution function of 
FO for different ages and genders have little shifting due to mask-wearing. 

For future research, the study could be expanded by using several algorithms of FO detection, such 
as YAAPT, PRAAT, YIN, and/or STRAIGHT (for these algorithms/ applications, more details in the 
introduction section of this article). In this research, the FO band has been divided into two sub-bands 
according to the average value of FO for each conversation. The FO band can be divided into several 
sub-bands by using the standard filter-bank scheme or by using the wavelet configuration of the frequency 
domain for the FO band. For the mathematical model which can be modified in this research, other models 
could be used instead of the statistical model (e.g., the stochastic model). Other statistical parameters could 
sustain the results of the research, such as the fourth-order (the kurtosis) of the central moment of the 
analyzed data. 
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